Allele frequency deviation (AFD) as a new prognostic model to predict overall survival in lung adenocarcinoma (LUAD)

Background Lung adenocarcinoma (LUAD) remains one of the world’s most known aggressive malignancies with a high mortality rate. Molecular biological analysis and bioinformatics are of great importance as they have recently occupied a large area in the studies related to the identification of various biomarkers to predict survival for LUAD patients. In our study, we attempted to identify a new prognostic model by developing a new algorithm to calculate the allele frequency deviation (AFD), which in turn may assist in the early diagnosis and prediction of clinical outcomes in LUAD. Method First, a new algorithm was developed to calculate AFD using the whole-exome sequencing (WES) dataset. Then, AFD was measured for 102 patients, and the predictive power of AFD was assessed using Kaplan–Meier analysis, receiver operating characteristic (ROC) curves, and area under the curve (AUC). Finally, multivariable cox regression analyses were conducted to evaluate the independence of AFD as an independent prognostic tool. Result The Kaplan–Meier analysis showed that AFD effectively segregated patients with LUAD into high-AFD-value and low-AFD-value risk groups (hazard ratio HR = 1.125, 95% confidence interval CI 1.001–1.26, p = 0.04) in the training group. Moreover, the overall survival (OS) of patients who belong to the high-AFD-value group was significantly shorter than that of patients who belong to the low-AFD-value group with 42.8% higher risk and 10% lower risk of death for both groups respectively (HR for death = 1.10; 95% CI 1.01–1.2, p = 0.03) in the training group. Similar results were obtained in the validation group (HR = 4.62, 95% CI 1.22–17.4, p = 0.02) with 41.6%, and 5.5% risk of death for patients who belong to the high and low-AFD-value groups respectively. Univariate and multivariable cox regression analyses demonstrated that AFD is an independent prognostic model for patients with LUAD. The AUC for 5-year survival were 0.712 and 0.86 in the training and validation groups, respectively. Conclusion AFD was identified as a new independent prognostic model that could provide a prognostic tool for physicians and contribute to treatment decisions. Supplementary Information The online version contains supplementary material available at 10.1186/s12935-021-02127-z.

histopathologically classified into two main subtypes: lung squamous cell carcinoma (LUSC) and lung adenocarcinoma (LUAD) [3], where the latter is the most common type, with a survival rate of approximately 15% within 5 years [4,5]. These histological subtypes play the main role of determining the therapeutic options. Although patients with NSCLC receive different treatments, whether early-stage surgical treatment or other potential curative treatments for different stages, the prognosis of patients with NSCLC in the early stages remains poor, with a relapse rate of approximately 40% in patients within 5 years [6] and a survival rate of 50-60% [7,8]. These information indicate the existence of some individual cases of high-risk among patients who are in the early stages of the disease. Therefore, patients need to be diagnosed in the early stages, and a reliable prognostic biomarker or prognostic factors to identify high-risk individuals are urgent and considerably important for NSCLC.
There is a range of different and varied studies in their results conducted at the recent time to identify the prognostic factors and/or prognostic biomarkers for the diagnosis of patients with lung adenocarcinoma (LUAD). These biomarkers may include one of the following types: (1) biomarkers associated with the risk of development of toxicity related to certain medications in patients and this biomarker is single nucleotide polymorphism (SNP) haplotype; (2) Biomarkers indicating the recurrence of the disease after surgical removal, they are found on the tumor or secreted by the tumor such as some proteins; (3) The presence of genetic mutations targeted by the therapy or the level of gene expression, both of which act as biomarkers; (4) Finally, the number of cancer cells circulating or the tumor metabolic activity may be another vital indicator. Many studies have demonstrated tumor mutation burden (TMB) as a biomarker for patients with LUAD [9]. For example, Rizvi et al. [10] demonstrated that high TMB levels were correlated with improved ORR and prolonged PFS in a retrospective analysis of patients with NSCLC. Talvitie et al. [11] in its study on lung adenocarcinoma patients has shown that TMB is an independent biomarker for predicting survival, as patients with TMB greater than or equal to 14 mutations/MB had a longer survival than patients with TMB less than 14 mutations/MB. In another study, Jiao et al. [12] proved that TMB was a negative biomarker to predict survival for LUAD patients, where the TMB was low in the group of patients with EGFR-mutation. In addition, change in mean variant allele frequencies (dVAF) has been identified as a predictor of clinical outcomes in NSCLC and UC [13]. Allele frequency deviation (AFD) refers to the degree of deviation between the single nucleotide variant (SNV) allele frequency to tumor samples and that of matched control samples, it can reflect the disease stats of patients, as demonstrated in another study on AFD involving patients with cervical cancer revealed that AFD was positively correlated with therapy response and it helped in estimating progression-free survival [14].
On the basis of the previous studies on many different prognostic biomarkers, particularly the AFD-related study [14], the relationship between AFD and overall survival was identified in patients with LUAD in the current study by developing a new algorithm for measuring AFD and then evaluating its predictive performance to predict the survival of LUAD patients in the early stages as an independent prognostic model. This study is considered the first study to report the direct association of AFD for the prediction of patients survival, which may contribute and help in the early detection of LUAD patients and making effective clinical decisions regarding potential individual treatment.

Data source
The raw data of whole-exome sequencing (WES) with clinical information related to patients with lung adenocarcinoma were obtained from Fudan University. The total number of patients after excluding those with insufficient clinical information was 102. They were randomly divided into two groups: training group, which included 54 patients, and validation group, which included 48 patients. The basic clinical characteristics included in the analysis are as follows: history of smoking, pT stage, age, sex, and tumor size. The details are provided in (Table 1). The data analysis process was carried out on the data collected by Fudan University that was previously used in another study [15] which was conducted according to the ethical standards (Fudan University Shanghai Cancer Center Institutional Review Board No. 090977-1). Informed consents of patients or their relatives were obtained while donating a samples to the tissue bank of Fudan University Shanghai Cancer Center [15]. For more information pertaining to the data analyzed in our study, the data can be accessed and obtain from the European Genome-phenome Archive (EGA) via using the following access code: EGAS00001004006.

Alignment and quality control
In-house pipelines were used to process the sequencing of 102 WES data. Tumor and normal sample quality data were evaluated using FastQC (http:/ www. bioin forma tics. babra ham. ac. uk/ proje cts/ fastqc/), including sequence length distribution, GC content, aspect of perbase quality, sequence duplicate levels, kemer content, and over-represented sequences [14]. Sequencing readings were aligned with the human reference genome (hg38) by using the Burrows-Wheeler Aligner (BWA) software package with default parameters [16]. The reads that were mapped in multiple genome positions were removed. Then, the quality of the map was accessed using SAM tools flagset [17]. All the genome sites for somatic variants were called by using VarScan2 [18] software with parameters of base quality higher than 30 and supporting reads ≥ 200 (Fig. 1).

Calling of SNV from WES
After all the readings were mapped to the human reference genome (hg38) by using BWA [16], Picard 1.67 was used to mark the duplicate readings realigned around the known indels. Base quality recalibration was performed using GATK version 3.7 [19]. Somatic mutations were called using Mutect2 after insuring that the following criteria have been met: first, the difference of mutant allele fraction (MAF) between the tumor and normal sample in the same patient was more than one percentage; second, in both tumor and normal samples, the sequencing coverage was more than 200; third, the alternative readings in the tumor samples were more than 10; and fourth, the corrected p-value was less than 0.05. SNVs were annotated using ANNOVAR in multiple databases [20] and further filtered with population frequency in ExAC, 1000 Genomes, dbSNP138.

Allele frequency deviation (AFD)
Variant allele frequency (VAF) of exome sites for 102 samples were called by using VarScan2 [18] software with the base quality higher than 30 and read depth ≥ 200, the WBC sample was used as a control to calibrate possible errors of the sequence and germline variants during the calculation of the VAF (Fig. 1). Then variant allele frequencies were used to calculate AFD for each patient. As displayed in (Fig. 2), a scatter plot was first created for all the detected genomic sites of the patient, with Y axis representing the VAF of a tumor sample and X axis representing the VAF of a paired normal sample. Second, a diagonal line, on which the points have the same VAF between both samples, was created. The distance from each point to this diagonal line was calculated and defined as d i of the i−th point. Third, the X,Y coordinates were transposed by − 45°; thus, d i is equal to the absolute value of the Y axis of i point and could be calculated using the Eq. (1): where y i ' is the transposed Y-axis value of the i point, the x i , y i is the original X and Y axis values. Finally, the AFD of the patients was calculated as in the Eq. (2): where di represent the distance value of all points i that are deviated from the diagonal line, n represent the total number of point.

Tumor mutation burden (TMB)
In short, the tumor mutation burden (TMB) is defined as the total number of somatic (nonsynonymous) mutations, which include the small insertions and deletions (INDELs) and single nucleotide variants (SNVs) for each megabase [21,22]. The golden standard method of measuring the TMB is through the use of WES, which can detect somatic mutations in the entire exome and thus give a comprehensive perception of all mutations that can contribute to the progress of the tumor at level of cost that is considered lower than the WGS [23]. The Quantile method based on TMB measurements was used to determine the appropriate cutting values [24].

Statistical analysis
Spearman correlation test was conducted to determine the correlation between factors, such as AFD and TMB. Kaplan-Meier (K-M) analysis was used to evaluate the differences in patient survival time between the high-and low-AFD value groups of patients with LUAD. The P values and HR (95% confidence interval [CI]) were determined via log-rank test and univariate Cox regression analysis to detect the significant differences between the groups. Multivariable Cox regression analysis performed to evaluate AFD independence. The ROC curve was used to estimate the performance of AFD by comparing the AUC. Statistical significance was identified as P ≤ 0.05. All statistical analyses were performed using version 3.5.1 of the R language.

Patients characteristic
The main histological subtype in this study was lung adenocarcinoma (LUAD  Table S1). The patients have not received any neoadjuvant treatment.

Relationship between AFD and TMB
In order to find out if the AFD and TMB are related, we performed a Spearman correlation test. Figure 3(A) shows the correlations between AFD and TMB in patients with LUAD. Spearman correlation coefficient showed that the p-value of the test was more than the significance level of 0.05. Therefore, AFD and TMB were not significantly associated at a correlation coefficient of 0.16 and p-value of 0.26 for the training group.
In the validation group, the result also showed no correlation between AFD and TMB, with a p-value of 0.6 and correlation coefficient = −0.077 (Fig. 3B).

Allele frequency deviation shows an active power to predict patient outcomes
A time-dependent curve was used to evaluate the sensitivity and specificity of AFD and TMB for OS prediction in the training and validation groups. The AFD and TMB significantly achieved almost the same AUC values of 0.713 and 0.721 ( Fig. 4C and D), respectively, in the training group, while in the validation group, AFD achieved an AUC of 0.86 and TMB achieved 0.65 ( Fig. 5C and D). These results demonstrated that AFD has the good power and efficient prognostic performance to predict the survival of patients with LUAD, which is reflected by the AUC value.

Overall survival
Considering that TMB and AFD are continuous variables and the cutting points for these variables are still not uniformly established, therefore in our study, we assumed that the risk of death is associated with the rise of AFD values, and in order to select a group of patients with high AFD values as a high-risk group and separate them from the low AFD values group as a low-risk group, we used the quantile method to get the correct cutting point based on AFD values. In the training set, the mean value of AFD was 13 Table 2). A gradual decrease was observed in survival from 78.6% at 12 months to 52.2% at 35 months in the high-AFD-value group. In the training group, the OS of patients who belong to the low-AFD-value (low-risk) group was significantly longer than that of patients who belong to the high-AFD-value (high-risk) group, with 10% lower risk of death and 42.8% higher risk of death for both groups, respectively (HR for death = 1.10; 95% CI 1.01-1.2, p = 0.03) (Tables 2 and 4). The patients in the high and low-AFD-value groups included in the survival analysis according to their cutoff points were 14 and 40, respectively. In the validation group, OS was found to be significantly longer in the low-AFD-value (low-risk) group than in the high-AFD-value (high-risk) group, with 5.5% lower risk of death and 41.6% higher risk of death for both groups, respectively (HR = 3.1, 95% CI 1.4-6.60, p = 0.003) (Tables 3 and 4). The patients in the high and low-AFD-value groups included in the survival analysis according to their cutoff points were 12 and 36, respectively. The one-sided stratified log-rank p-values were 0.0064 (Fig. 4A) and 0.0013 (Fig. 5A) for the training and validation groups, respectively, indicating a significant difference between the two groups regardless of the number of patients in each group. The result also showed that patients with high AFD values were at higher risk of death than patients with low AFD values. The Kaplan-Meier curve for TMB in the training group showed that the high-level patients had significantly shorter OS than the low-level patients, with 35.7% higher risk of death (HR = 1.08, 95% CI 0.96-1.2, p = 0.17). Thus, the OS was 62.5% at 31 months (95% CI 41-95.3) in the high-level TMB group and 89.9% (95% CI 80.9-99.8) in the low-level TMB group (Tables 2 and 4). The number of patients in the highlevel group was 40, while it was 14 in the low-level group. The one-sided stratified log-rank p-value was notably 0.03, indicating the difference between the two groups in regard to OS (Fig. 4B). In the validation + + + + + + + ++ + +++ + + + + + ++ + + ++ + + +++ ++ + + + + + + + + ++   group, no significant differences were found between the two groups in the Kaplan-Meier curve (Fig. 5B).
The numbers of patients in the high and low-level groups were 36 and 12, respectively.

AFD as an independent prognostic factor
Herein, univariate and multivariable Cox regression analyses were conducted in the training and validation groups to assess the contribution of AFD as an independent prognostic factor for patients with LUAD. AFD and other clinicopathological factors, including gender, smoking, age, pT, and tumor-size, were used as covariates. Univariate regression analysis indicated that AFD (p = 0.03) was significantly associated with patient survival, while sex (p = 0.47), age (p = 0.31), tumor size (p = 0.28), smoking (p = 0.22), pT (P = 0.68) and TMB (p = 0.17) were not significantly associated with patient survival in the training group, as shown in (Table 4). For the validation group, the analysis showed that AFD (p = 0.003) was the only factor correlated with patient survival; the other clinical factors did not show any association with patient survival ( Table 4). The corresponding multivariable cox regression analysis confirmed that the AFD in the training (HR = 1.125, 95% CI = 1.001-1.26, P = 0.04) and validation (HR = 4.62, 95% CI 1.22-17.4, P = 0.02) groups was an independent prognostic factor (Table 4). These results showed that AFD is an independent risk factor that could be used as a prognostic tool for patients with LUAD to assist in the early diagnosis for LUAD patients.

Discussion
The time of survival differs due to the different stages of LUAD among patients, as this type of cancer is heterogeneous. Many clinical variables have taken up a wide area in the field of predicting the diagnosis and treatment of patients with LUAD, but the results are uneven. The most important factors are TNM stage, race, age, tumor size, and gender these are factors related to the patient. Other factors related to the tumor also contribute to the prediction of the outcomes and treatment of patients, including the invasion of blood vessels and cell differentiation [25][26][27][28][29].
In the current study, the patients with high AFD values were assumed to be at a high risk compared with those with low AFD values. Therefore, AFD may act as an indicator of the progress of the disease and the survival rate of patients. For confirmation, the patients were divided into two groups. The first group consisted of patients with high AFD values, while the second group consisted of those with low AFD values. The quantile method was used to obtain the appropriate cutoff point to separate patients into two groups in a scientific and unbiased manner. Through this cutoff value, a significant difference was obtained between the high and low-risk groups. Thus, AFD had a clear effect in predicting the survival of patients and identifying patients who are at high risk. Multivariable cox regression analysis showed that AFD is an independent prognostic tool capable of predicting survival in patients with LUAD. In addition, ROC analysis showed that AFD has the effect power to predict overall survival of patients.
Previous studies have shown that TMB was significantly correlated with immune checkpoint inhibitors (ICIs), such as PD-L1 and PD-1, and other biomarkers, including EGFR and TP53 [30][31][32]. In the present research, the relationship between AFD and TMB were evaluated, and the results showed no correlation between the two. Furthermore, the AUC of the prediction for patient survival in AFD and TMB was high and almost the same, suggesting that AFD had a substantial efficiency not less than the efficiency of TMB to predict overall survival. In addition, these results are consistent with the findings in the Kaplan-Meier analysis for patients with LUAD, with a high statistical significance of AFD in the prediction. The patients were also divided by AFD into high and lowvalue risk group, the patients with high AFD value had shorter OS than those with low AFD value. On the contrary, univariate and multivariable cox regression analyses showed that TMB tended to be a non-independent prognostic factor for predicting the survival of patients with LUAD, and no significant association was observed between TMB and LUAD patients survival. This finding is consistent with that of previous studies [33,34], which showed that TMB was significantly related to the prediction of the response of patients to the medications used in order to determine their effectiveness. Interestingly, AFD displayed a efficiency and predictive ability in both analyses and emerged as an independent prognostic factor.
A number of studies have reported that tumor size is a prognostic factor used to predict patient progression and outcomes [35]. A previous study related to AFD demonstrated the effectiveness of AFD in predicting the benefit and response of patients with cervical cancer to treatment, and the predicted evidence of metastases was better than that of tumor size [14]. In the present study, AFD was shown to be independent of tumor size, and patients with high AFD values had worse prognosis than patients with low AFD values. Therefore, AFD can be considered as a prognostic factor for predicting the outcome of patients with LUAD, consequently suggesting the use of AFD in clinical application for the purpose of early diagnosis of lung adenocarcinoma patients.
AFD is still a new model that has not yet been used as a prognostic model for the prediction of clinical outcomes in lung adenocarcinoma or any other type of cancer. Therefore, this study is the first to show that AFD is effective as an independent prognostic model that has the predictive power to identify high-risk groups of patients with LUAD. In addition, these results may indicate a more fundamental role in AFD efficacy in early LUAD detection and accurate survival prediction. However, this study has limitations. First, the number of samples was small, and this limitation could be avoided by conducting a study with a large number of patients. AFD could be applied to measure the effectiveness of medicines by measuring the patient's response to the treatment used by studying those who used certain treatments. In addition, as a prognostic model, AFD can be applied in further cancer research to verify it in different types of cancer.

Conclusion
In conclusion, we developed a new prognostic analytical model by developing a new algorithm to calculate the allele frequency deviation (AFD) which characterized by effectiveness predictive performance to predict the survival of LUAD patients. Furthermore, AFD is an independent prognostic tool for predicting survival in patients with LUAD. The study results provided evidence of the possibility of using the AFD in the early diagnosis of patients with LUAD and therefore it may be possible to use AFD in clinical application as a new prognostic tool to predict the patient's outcomes and contribute to follow-up monitoring and help clinicians make effective decisions regarding the potential individual treatment of LUAD patients, which improves their survival. Despite these findings, the model needs further investigation and application in other types of cancers.